🎨Computational backend: Make sure the number of threads of a dask-worker is computed for autoscaling 🚨🚨🚨 #8423

sanderegg · 2025-09-25T09:59:54Z

What do these changes do?

The autoscaling service when running in computational mode tries to understand via the dask client what are the needs for EC2 instances.

In billable systems, the hardware is specified with the task, autoscaling checks it fits,
In non-billable systems required resources are specified with the task (e.g. CPUs, RAM, ...) , autoscaling finds the best suitable EC2 instance based on the resources,

For non-billable systems:
Until this PR, the autoscaling service would estimate the available resources from an EC2 using what the AWS EC2 API returns, which is not exactly what Docker or even Dask then sees once the EC2 instance is up and running.

This would sometimes create dead locks where a machine that should in theory handle the task would actually not since the docker engine and/or dask worker "sees" a bit less memory or cpus. This PR shall correct this fact by using the same computations everywhere.

**computational provider (a.k.a. dask)
Every dask-worker has a defined number of threads, a.k.a. the theoretical number of jobs that can be completed in parallel.
The dask-worker in our implementation either takes what CPUs it finds, or overrides it with DASK_NTHREADS environement, or use the DASK_NTHREADS_MULTIPLIER environment.

The meaning of this is that even if a user wants to run 50 tasks requiring 0.1 CPU and a machine has 20 CPUs, it cannot run more than nthreads in parallel.
This PR allows now the autoscaling service to understand this concept by allowing so-called "generic resources". This will also open the door to add GPU support and any kind of resource.

🚨🚨🚨 some caution on deployment to ensure everything runs as smooth as possible

Related issue/s

fixes Autoscaling: in non-billable systems the chosen machine type does not take in account the removed resources as the dask-sidecar does #6320

How to test

Dev-ops

codecov · 2025-09-25T10:11:23Z

Codecov Report

❌ Patch coverage is 96.34703% with 8 lines in your changes missing coverage. Please review.
✅ Project coverage is 87.44%. Comparing base (379e430) to head (f2bcb52).
⚠️ Report is 1 commits behind head on master.

Additional details and impacted files

@@            Coverage Diff             @@
##           master    #8423      +/-   ##
==========================================
+ Coverage   87.01%   87.44%   +0.43%     
==========================================
  Files        2011     1604     -407     
  Lines       78602    66901   -11701     
  Branches     1348      761     -587     
==========================================
- Hits        68392    58499    -9893     
+ Misses       9807     8149    -1658     
+ Partials      403      253     -150

Flag	Coverage Δ
integrationtests	`63.96% <28.57%> (+3.57%)`	⬆️
unittests	`85.94% <96.34%> (-0.30%)`	⬇️

Components	Coverage Δ
pkg_aws_library	`94.98% <100.00%> (+1.37%)`	⬆️
pkg_celery_library	`∅ <ø> (∅)`
pkg_dask_task_models_library	`79.00% <76.92%> (-0.34%)`	⬇️
pkg_models_library	`∅ <ø> (∅)`
pkg_notifications_library	`∅ <ø> (∅)`
pkg_postgres_database	`∅ <ø> (∅)`
pkg_service_integration	`∅ <ø> (∅)`
pkg_service_library	`70.96% <62.50%> (-0.01%)`	⬇️
pkg_settings_library	`∅ <ø> (∅)`
pkg_simcore_sdk	`84.77% <ø> (-0.24%)`	⬇️
agent	`93.10% <ø> (ø)`
api_server	`91.62% <ø> (ø)`
autoscaling	`95.83% <99.21%> (+0.83%)`	⬆️
catalog	`92.06% <ø> (ø)`
clusters_keeper	`99.14% <ø> (ø)`
dask_sidecar	`92.38% <ø> (ø)`
datcore_adapter	`97.95% <ø> (ø)`
director	`75.81% <ø> (ø)`
director_v2	`91.02% <100.00%> (+5.70%)`	⬆️
dynamic_scheduler	`96.66% <ø> (ø)`
dynamic_sidecar	`90.37% <ø> (-0.07%)`	⬇️
efs_guardian	`89.83% <ø> (ø)`
invitations	`90.90% <ø> (ø)`
payments	`92.80% <ø> (ø)`
resource_usage_tracker	`92.22% <ø> (ø)`
storage	`86.56% <ø> (-0.37%)`	⬇️
webclient	`∅ <ø> (∅)`
webserver	`87.05% <66.66%> (-0.02%)`	⬇️

Continue to review full report in Codecov by Sentry.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 379e430...f2bcb52. Read the comment docs.

🚀 New features to boost your workflow:

📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

mergify · 2025-09-25T10:20:56Z

🧪 CI Insights

Here's what we observed from your CI run for f2bcb52.

✅ Passed Jobs With Interesting Signals

Pipeline	Job	Signal	Health on `master`	Retries	🔍 CI Insights	📄 Logs
`CI`	`system-tests`	Base branch is broken, but retries were needed. Could be early signs of flakiness 👀		`1`	View	View

Copilot

Pull Request Overview

This PR enhances the computational backend to properly compute and track the number of threads for dask-workers in the autoscaling system. The key changes involve adding support for generic resources (particularly threads) to the Resources model and extending the Dask monitoring configuration.

Add generic resources support to the Resources model with proper arithmetic operations
Introduce DASK_NTHREADS and DASK_NTHREADS_MULTIPLIER configuration settings
Update resource comparison and computation logic to handle the new generic resources

Reviewed Changes

Copilot reviewed 13 out of 13 changed files in this pull request and generated 2 comments.

Show a summary per file

File	Description
services/clusters-keeper/src/simcore_service_clusters_keeper/data/docker-compose.yml	Adds DASK_NTHREADS environment variables to the compose file
services/autoscaling/src/simcore_service_autoscaling/core/settings.py	Introduces new Dask thread configuration settings
packages/aws-library/src/aws_library/ec2/_models.py	Extends Resources model with generic_resources field and related operations
services/autoscaling/src/simcore_service_autoscaling/modules/dask.py	Adds function to compute instance thread resources
services/autoscaling/src/simcore_service_autoscaling/utils/cluster_scaling.py	Updates resource comparison logic
services/autoscaling/tests/unit/test_modules_cluster_scaling_computational.py	Refactors resource mapping logic and updates tests

_{Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.}

packages/aws-library/src/aws_library/ec2/_models.py

packages/aws-library/tests/test_ec2_models.py

Copilot

Pull Request Overview

Copilot reviewed 18 out of 18 changed files in this pull request and generated 2 comments.

_{Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.}

.../autoscaling/src/simcore_service_autoscaling/modules/cluster_scaling/_utils_computational.py

...toscaling/src/simcore_service_autoscaling/modules/cluster_scaling/_provider_computational.py

Copilot

Pull Request Overview

Copilot reviewed 24 out of 24 changed files in this pull request and generated 5 comments.

_{Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.}

services/autoscaling/tests/unit/test_utils_cluster_scaling.py

packages/aws-library/src/aws_library/ec2/_models.py

packages/aws-library/tests/test_ec2_models.py

services/autoscaling/src/simcore_service_autoscaling/core/errors.py

packages/aws-library/src/aws_library/ec2/_models.py

Copilot

Pull Request Overview

Copilot reviewed 24 out of 24 changed files in this pull request and generated 7 comments.

Comments suppressed due to low confidence (1)

services/autoscaling/src/simcore_service_autoscaling/modules/dask.py:1

Unconditionally injecting a thread resource of 1 per task in both processing and unrunnable lists duplicates logic and may misrepresent tasks that already define a thread-related generic resource. Centralizing this augmentation (e.g. via a helper that only adds the key if absent) reduces duplication and prevents accidental overwrites.

import collections

_{Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.}

services/autoscaling/tests/unit/test_modules_dask.py

services/autoscaling/src/simcore_service_autoscaling/modules/dask.py

.../autoscaling/src/simcore_service_autoscaling/modules/cluster_scaling/_utils_computational.py

services/autoscaling/tests/unit/test_modules_cluster_scaling_computational.py

services/autoscaling/src/simcore_service_autoscaling/modules/dask.py

Copilot

Pull Request Overview

Copilot reviewed 28 out of 28 changed files in this pull request and generated 5 comments.

_{Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.}

packages/dask-task-models-library/src/dask_task_models_library/resource_constraints.py

services/autoscaling/src/simcore_service_autoscaling/modules/dask.py

...ces/autoscaling/src/simcore_service_autoscaling/modules/cluster_scaling/_provider_dynamic.py

...toscaling/src/simcore_service_autoscaling/modules/cluster_scaling/_provider_computational.py

services/autoscaling/tests/unit/test_modules_cluster_scaling_computational.py

Copilot

Pull Request Overview

Copilot reviewed 28 out of 28 changed files in this pull request and generated 8 comments.

_{Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.}

services/autoscaling/src/simcore_service_autoscaling/core/settings.py

packages/aws-library/src/aws_library/ec2/_models.py

services/autoscaling/src/simcore_service_autoscaling/modules/dask.py

packages/aws-library/src/aws_library/ec2/_models.py

.../autoscaling/src/simcore_service_autoscaling/modules/cluster_scaling/_utils_computational.py

services/autoscaling/src/simcore_service_autoscaling/modules/dask.py

packages/aws-library/src/aws_library/ec2/_models.py

…/resource_constraints.py Co-authored-by: Copilot <[email protected]>

…ask.py Co-authored-by: Copilot <[email protected]>

…luster_scaling/_utils_computational.py Co-authored-by: Copilot <[email protected]>

Co-authored-by: Copilot <[email protected]>

sonarqubecloud · 2025-10-24T14:33:07Z

Quality Gate passed

Issues
0 New issues
2 Accepted issues

Measures
0 Security Hotspots
0.0% Coverage on New Code
0.0% Duplication on New Code

See analysis details on SonarQube Cloud

…dask-worker is computed for autoscaling 🚨🚨🚨 (ITISFoundation#8423)" This reverts commit ffe52c1.

…ker is computed for autoscaling 🚨🚨🚨 (ITISFoundation#8423) Co-authored-by: Copilot <[email protected]>

…#8557)

…ker is computed for autoscaling 🚨🚨🚨 (ITISFoundation#8423) Co-authored-by: Copilot <[email protected]>

sanderegg added this to the Cheops milestone Sep 25, 2025

sanderegg self-assigned this Sep 25, 2025

sanderegg added a:autoscaling autoscaling service in simcore's stack a:computational clusters labels Sep 25, 2025

sanderegg force-pushed the autoscaling/dask-provider-check-nthreads branch from 1eb7d20 to 9b3ec9f Compare September 25, 2025 11:56

sanderegg requested a review from Copilot September 25, 2025 15:24

Copilot AI reviewed Sep 25, 2025

View reviewed changes

packages/aws-library/src/aws_library/ec2/_models.py Outdated Show resolved Hide resolved

packages/aws-library/tests/test_ec2_models.py Outdated Show resolved Hide resolved

sanderegg force-pushed the autoscaling/dask-provider-check-nthreads branch 3 times, most recently from ec5ef47 to f68a40d Compare September 26, 2025 14:53

sanderegg requested a review from Copilot September 26, 2025 16:28

Copilot AI reviewed Sep 26, 2025

View reviewed changes

.../autoscaling/src/simcore_service_autoscaling/modules/cluster_scaling/_utils_computational.py Outdated Show resolved Hide resolved

...toscaling/src/simcore_service_autoscaling/modules/cluster_scaling/_provider_computational.py Outdated Show resolved Hide resolved

sanderegg force-pushed the autoscaling/dask-provider-check-nthreads branch 4 times, most recently from 8b7db41 to 83bf3b0 Compare October 17, 2025 14:37

sanderegg requested a review from Copilot October 19, 2025 20:29

Copilot AI reviewed Oct 19, 2025

View reviewed changes

sanderegg force-pushed the autoscaling/dask-provider-check-nthreads branch from 2c11b7e to 0dcad8e Compare October 20, 2025 11:55

sanderegg requested a review from Copilot October 20, 2025 11:58

Copilot AI reviewed Oct 20, 2025

View reviewed changes

sanderegg requested a review from Copilot October 20, 2025 15:40

Copilot AI reviewed Oct 20, 2025

View reviewed changes

sanderegg requested a review from Copilot October 20, 2025 16:07

Copilot AI reviewed Oct 20, 2025

View reviewed changes

sanderegg modified the milestones: Cheops, Imparable Oct 21, 2025

sanderegg force-pushed the autoscaling/dask-provider-check-nthreads branch from 32d9139 to cbe908b Compare October 21, 2025 16:29

sanderegg and others added 15 commits October 24, 2025 16:31

added test and a fix

e3b998e

fix code

3bb8fad

adjust ram cpu

a6c322b

Update packages/dask-task-models-library/src/dask_task_models_library…

2e71395

…/resource_constraints.py Co-authored-by: Copilot <[email protected]>

Update services/autoscaling/src/simcore_service_autoscaling/modules/d…

4cc9bb3

…ask.py Co-authored-by: Copilot <[email protected]>

fix return value

ed8fcd0

created a base function to compute resources

e23a789

fixed tests

525954c

pylint

fc09515

linter

a7d975b

fixed tests

c998a25

fixed types

6371b25

Update services/autoscaling/src/simcore_service_autoscaling/modules/c…

c773ac1

…luster_scaling/_utils_computational.py Co-authored-by: Copilot <[email protected]>

Update packages/aws-library/src/aws_library/ec2/_models.py

29d4211

Co-authored-by: Copilot <[email protected]>

@pcrespov review: add some more string comparisons

f2bcb52

sanderegg force-pushed the autoscaling/dask-provider-check-nthreads branch from 7f30e5d to f2bcb52 Compare October 24, 2025 14:31

sanderegg merged commit ffe52c1 into ITISFoundation:master Oct 24, 2025
144 of 148 checks passed

sanderegg deleted the autoscaling/dask-provider-check-nthreads branch October 24, 2025 15:32

sanderegg mentioned this pull request Oct 24, 2025

🐛Autoscaling: fixes unknown passing type to dask-scheduler #8556

Merged

sanderegg added a commit to sanderegg/osparc-simcore that referenced this pull request Oct 24, 2025

Revert "🎨Computational backend: Make sure the number of threads of a …

ce4624b

…dask-worker is computed for autoscaling 🚨🚨🚨 (ITISFoundation#8423)" This reverts commit ffe52c1.

sanderegg mentioned this pull request Oct 24, 2025

🚑️ Revert #8423 + #8556 until later fix and allow for staging release #8557

Merged

sanderegg mentioned this pull request Oct 24, 2025

✨♻️Autoscaling/dask nthreads2 🚨 🚨 🚨 #8558

Merged

sanderegg added a commit that referenced this pull request Oct 24, 2025

🚑️ Revert #8423 + #8556 until later fix and allow for staging release (…

a2ecd48

…#8557)

matusdrobuliak66 mentioned this pull request Oct 28, 2025

🚀 Pre-release master -> staging_Imparable1 #8430

Closed

22 tasks

matusdrobuliak66 mentioned this pull request Oct 31, 2025

🚀 Release v1.87.0 #8432

Closed

56 tasks

🎨Computational backend: Make sure the number of threads of a dask-worker is computed for autoscaling 🚨🚨🚨 #8423

🎨Computational backend: Make sure the number of threads of a dask-worker is computed for autoscaling 🚨🚨🚨 #8423

Uh oh!

Conversation

sanderegg commented Sep 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What do these changes do?

Related issue/s

How to test

Dev-ops

Uh oh!

codecov bot commented Sep 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

mergify bot commented Sep 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🧪 CI Insights

✅ Passed Jobs With Interesting Signals

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Reviewed Changes

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

sonarqubecloud bot commented Oct 24, 2025

Quality Gate passed

Uh oh!

Uh oh!

Reviewers

Assignees

sanderegg commented Sep 25, 2025 •

edited

Loading

codecov bot commented Sep 25, 2025 •

edited

Loading

mergify bot commented Sep 25, 2025 •

edited

Loading